A Similarity Graph of Distances

“To understand is to perceive patterns.”

-Isaiah Berlin, The proper study of mankind, p 129

Introduction

Much has been written and published about combining datasets with the same set of observations (rows) but different features (columns) - especially in the biological realm where OMIC technology has made it feasible to measure multitudes of different biological markers from the same set of samples - but the literature often stops short of supplying a comprehensive solution. That is, it makes abundant sense to try and combine several datasets with overlapping qualities to gain a more holistic view of the observations, but the datasets themselves often require multiple views to make sense of the patterns within. Or, in more abstract terms, combining patterns that do not have their own patterns sorted out and accentuated will yield sub-optimal results. As always, garbage-in-garbage-out.

PreciseDist seeks to provide a comprehensive solution for the above problem by making it feasible to provide many views of a single dataset that can then be combined with many views of a separate dataset where each view is a distance. The question then arises, however, which views should be combined? At the level of combining several datasets into a single view, the answer to this question is usually clear using domain-expertise. For example, it makes perfect sense to try and combine gene expression data with methylation data because the datasets are at the same time disparate and reliant. At the level of the single dataset though where each view is a distance, which distances should be combined?

Thus, this vignette is an attempt to help the user decide which distances should be combined by making the relationships between different distances self-evident. And, while we only calculate and then show the relationship between distances for a single dataset, this methodology can easily be scaled to multiple datsets by simply adding them in before the call to precise_correlations() (using, for example, the append() function if the distances are in list format or rbind() if the distance are in data frame format).

Data set-up

The data and set-up comes from the Cell Cycle Vignette, so see that vignette for more details. However, while in that vignette we set the precise_dist() dists parameter as dists = “static_dists”, here we are going to calculate every possible distance by setting dists = “all_dists”:

library(PreciseDist)
data("data_cell_cycle")
library(tibble)
as_tibble(data_cell_cycle)
cell_cycle_data <- data_cell_cycle %>%
  dplyr::select(-Cell_cycle) %>%
  as.matrix()
cell_cycle_labels <- data_cell_cycle %>%
  dplyr::select(Cell_cycle) %>%
  as.matrix()
cell_cycle_all_dists <- cell_cycle_data %>%
  as.matrix() %>%
  precise_dist(
    dists = "all_dists",
    dist_funcs = NULL,
    time_series = FALSE,
    partitions = 1,
    suffix = "Cell_Cycle_",
    file = "/absolute_path/to_somewhere/with_full_name/including_the/file_extension.rds",
    parallel = TRUE,
    local_timeout = Inf,
    verbose = TRUE
  )

Now that we have our 94 distances/similarities/correlations, we will enforce them to be all distances, which is what precise_correlations() is expecting as input:

cell_cycle_all_distances <- cell_cycle_all_dists%>%
  precise_transform(enforce_dist = TRUE) %>%
  precise_transform(transform = "range01") %>%
  precise_transform(remove_errors = TRUE)

#> [1] "Returning a list"
#> [1] "Returning a list"

Now that our distances are in the correct format, we will create a correlation matrix of distances:

library(future)
library(doFuture)
options(future.globals.maxSize = +Inf)
registerDoFuture()
plan(multicore, workers = 10)
cell_cycle_dist_cors <- cell_cycle_all_distances %>%
  precise_correlations(
    method = "pearson",
    permutations = 50,
    parallel = TRUE,
    verbose = FALSE
  )

We can view the correlation matrix with heatmaply():

library(heatmaply)
heatmaply(cell_cycle_dist_cors$statistic)

While the heatmap is a nice view, it would be nice to have other views of the relationships between distances as well. First, we will transform the correlations into distances and then create the graph:

cell_cycle_correlation_distances <- cell_cycle_dist_cors$statistic %>%
  precise_transform(enforce_dist = TRUE)
cell_cycle_correlation_graph <- precise_graph(
  data = cell_cycle_correlation_distances,
  method = 3,
  n_neighbors = 15,
  spread = 1,
  min_dist = 0.01,
  bandwidth = 1,
  parallel = FALSE,
  verbose = FALSE
)

And, now we will use trellis_viz() to see multiple different graphing possibilities at once:

trellis_viz(
  data = cell_cycle_correlation_graph,
  path = "/absolute_path/to_somewhere/with_full_name/not_including_a/file_extension",
  name = "Graphs_and_Plots",
  plot_type = "2d_graphs",
  k = 5,
  color_vec = rownames(cell_cycle_dist_cors$statistic),
  colors = NULL,
  size = 0.5,
  self_contained = TRUE,
  group = "common",
  desc = "",
  md_desc = "",
  height = 500,
  width = 500,
  nrow = 2,
  ncol = 3,
  parallel = FALSE,
  verbose = TRUE
)

We now have several different pleasing views of how our distances are related to each other. We can, of course, choose our distances now manually by looking at the relationships between distances using the correlation matrix or the graph. At this point though, we have already created the graph, so we will now cluster it to see how various clustering algorithms decide which distances are most similar:

library(future)
library(doFuture)
registerDoFuture()
plan(multiprocess, workers = 10)
cell_cycle_cor_clusters <- precise_cluster(
  cell_cycle_correlation_graph,
  cluster_alg = c("all"),
  parallel = TRUE,
  verbose = FALSE
)

Let’s add the names of the different distances to the different clustering results. We have never mixed up the row names while running nay of our algorithms, so we can simply extract the row names from cell_cycle_dist_cors$statistic and column bind them to the clustering results. We name the output “descriptors” because that is exactly what we are going to use them as below:

library(tibble)
descriptors <- cell_cycle_dist_cors$statistic %>%
  rownames() %>%
  as_tibble() %>%
  select(Distance = value) %>%
  cbind(cell_cycle_cor_clusters)
descriptors %>%
    as_tibble()
#> # A tibble: 91 x 7
#>    Distance edge_betweenness fast_greedy infomap label_prop louvain
#>    <chr>    <chr>            <chr>       <chr>   <chr>      <chr>  
#>  1 acf      Cluster_1        Cluster_1   Cluste… Cluster_1  Cluste…
#>  2 additiv… Cluster_1        Cluster_1   Cluste… Cluster_1  Cluste…
#>  3 ar.lpc.… Cluster_1        Cluster_1   Cluste… Cluster_1  Cluste…
#>  4 ar.mah   Cluster_1        Cluster_1   Cluste… Cluster_1  Cluste…
#>  5 ar.pic   Cluster_1        Cluster_1   Cluste… Cluster_1  Cluste…
#>  6 avg      Cluster_2        Cluster_2   Cluste… Cluster_2  Cluste…
#>  7 braun_b… Cluster_1        Cluster_1   Cluste… Cluster_3  Cluste…
#>  8 bray     Cluster_3        Cluster_3   Cluste… Cluster_4  Cluste…
#>  9 canberra Cluster_3        Cluster_3   Cluste… Cluster_4  Cluste…
#> 10 canberr… Cluster_1        Cluster_1   Cluste… Cluster_1  Cluste…
#> # ... with 81 more rows, and 1 more variable: walktrap <chr>

We will now create a stand-alone visualization of our graph. We already ran trellis_viz() to get a taste of how each 2D graph layout in precise_viz() would look, and in this instance we think the graphopt layout looks nicest. So, we will call precise_viz() using our graph from before as the data input and with plot_type = “graphopt_2d_graph”:

cell_cycle_correlation_viz <- precise_viz(
  data = cell_cycle_correlation_graph,
  plot_type = "graphopt_2d_graph",
  k = 5,
  color_vec = NULL,
  colors = NULL,
  size = 0.5,
  graphml = NULL,
  html =  NULL,
  verbose = FALSE
)
cell_cycle_correlation_viz$visual_output

This looks good to us, so we will now pass the layout as the data argument to trellis_descriptors() (we don’t need to do anything else here because trellis_descriptors() will know that the data input is coming from precise_viz() and will extract the plot layout accordingly), and we will use the descriptors we created above as the input into the descriptors parameter:

trellis_descriptors(
  data = cell_cycle_correlation_viz,
  descriptors = descriptors,
  path = "/absolute_path/to_somewhere/with_full_name/not_including_a/file_extension",
  name = "clusters",
  group = "clusters",
  size = 0.5,
  rank = FALSE,
  self_contained = FALSE,
  desc = "",
  md_desc = "",
  height = 500,
  width = 500,
  nrow = 1,
  ncol = 1,
  verbose = TRUE
)

Voila! We can now take a look at how each of the clusterings map onto our graphopt visualization of our distance correlation matrix. Now that we have this information of which distances are most closely related, it is up to us to choose how to use this information. Do we want to fuse distances that are very similar or the distances that are different or something else? This is a personal choice, but let’s continue our exercise by deciding that cluster 4 from the fast_greedy algorithm looks like the distances we want to keep. To do this, we can simply take our descriptors data frame and filter it like this:

keep_string <- descriptors %>%
  filter(fast_greedy == "Cluster_4") %>%
  select(Distance) %>%
  .[[1]] %>%
  as.character()

We named the output keep_string here because that is exactly how we can use it. If we want to filter our original 94 distances we calculated in the beginning for just the distances present in cluster 4 of fast_greedy, it is now as simple as this:

cell_cycle_keep_dists <- cell_cycle_all_dists %>%
  precise_transform(return_list = TRUE) %>%
  precise_transform(keep_string = keep_string)

Alternatively, we can use our keep_string (now a misnomer) to filter our original dataset so none of the distances in cluster 4 of fast_greedy are kept:

cell_cycle_keep_dists <- cell_cycle_all_dists %>%
  precise_transform(return_list = TRUE) %>%
  precise_transform(filter_string = keep_string)

Brian Muchmore

2018-09-26

Introduction

Data set-up

Contents